-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: pd.Index.is_all_dates doesn't require full-scan inference #6341
Conversation
fyi...this is a cached attribute so only hit once, but it is valid the infer_dtype routines does do this optimization however |
I know that. Unfortunately, the cache doesn't help in case of slicing: the result is always a new object (unless the slice was itself cached), the cache is not calculated yet and the value is required during construction unconditionally (
I'm not sure what you mean. |
look at the impl of |
Hmmm... can it be that you confuse these two cases: In [50]: arr_s = np.array(map(str, range(10000)))
In [51]: arr_o = np.array(map(str, range(10000)), dtype=object)
In [52]: timeit pd.lib.infer_dtype(arr_s)
1000000 loops, best of 3: 351 ns per loop
In [53]: timeit pd.lib.infer_dtype(arr_o)
10000 loops, best of 3: 42 µs per loop
In [54]: pd.lib.infer_dtype(arr_s)
Out[54]: 'string'
In [55]: pd.lib.infer_dtype(arr_o)
Out[55]: 'string' Because, I cannot see how elif PyString_Check(val):
if is_string_array(values):
return 'string' |
no that is correct. The Pls run a perf-check with the change and see if anything changes. and post the results
|
Here's the result. Is there a benchmark for slice indexing? |
look at so you prob need to create a benchmark which tests this change....nothing is obviously improved (but its a worthwhile change though) you can use the example from above that you did |
Done and updated the gist. |
nice! ok just need an entry in release notes (ref this pr as their is no associated issue) and good 2 go |
The patch avoids scanning the whole slice in Index.is_all_dates check.
Done. |
@immerrr thanks..... sorry about being pendantic....but expect a lot of contributions from you! |
While working on #6328 I've stumbled upon this piece that deserves optimization.
Current implementation
Index.is_all_dates
invokeslib.infer_dtype
which does a full-scan dtype inference which is not not always necessary. If I read cython sources correctly, new code is equivalent to the old one and the performance increase comes from short-circuiting the check on the first non-datetime element.This change cuts roughly 50% of runtime of the following snippet:
On recent master:
On this branch: